Load and preprocess the adult data as before. include dummy encoding and scaling Learn a logistic regression model and visualize the coefficients. Then grid-search the regularization parameter C. compare L1 penalty to L2 penalty. how are the coefficients different? which are the most important features?
In [ ]:
import pandas as pd
# The file has no headers naming the columns, so we pass header=None
# and provide the column names explicitly in "names"
data = pd.read_csv(
"adult.data", header=None, index_col=False,
names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'gender',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'income'])
# this column is somewhat meaningless in this context
data = data.drop("fnlwgt", axis=1)
data.head()